SELFormer: Molecular Representation Learning via SELFIES Language Models

نویسندگان

چکیده

Abstract Automated computational analysis of the vast chemical space is critical for numerous fields research such as drug discovery and material science. Representation learning techniques have recently been employed with primary objective generating compact informative numerical expressions complex data, efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations processing string-based notations chemicals via natural language algorithms. Majority methods proposed so far utilize SMILES this purpose, which most extensively used encoding molecules. However, associated problems related validity robustness, may prevent model from effectively uncovering knowledge hidden data. In study, we propose SELFormer, a transformer architecture-based (CLM) that utilizes 100% valid, expressive notation, SELFIES, input, order flexible high-quality representations. SELFormer pre-trained on two million drug-like compounds fine-tuned diverse property Our performance evaluation has revealed that, outperforms all competing methods, including graph learning-based approaches SMILES-based CLMs, predicting aqueous solubility molecules adverse reactions, while producing comparable results remaining We also visualized learned by dimensionality reduction, indicated even can discriminate differing structural properties. shared programmatic tool, together its datasets models at https://github.com/HUBioDataLab/SELFormer . Overall, our demonstrates benefit using SELFIES context modeling opens up new possibilities design novel candidates desired features.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Visual Representation, Textual Representation, and Glossing on Second Language Vocabulary Learning

In this study, the researcher chose three different vocabulary techniques (Visual Representation, Textual Enhancement, and Glossing) and compared them with traditional method of teaching vocabulary. 80 advanced EFL Learners were assigned as four intact groups (three experimental and one control group) through using a proficiency test and a vocabulary test as a pre-test. In the visual group, stu...

متن کامل

Online Representation Learning in Recurrent Neural Language Models

We investigate an extension of continuous online learning in recurrent neural network language models. The model keeps a separate vector representation of the current unit of text being processed and adaptively adjusts it after each prediction. The initial experiments give promising results, indicating that the method is able to increase language modelling accuracy, while also decreasing the pa...

متن کامل

Learning Via Compact Data Representation

We present an unsupervised learning methodology derived from compact data encoding and demonstrate how to construct models of polysemy, priming, semantic disambiguation and learning using this theoretical basis. The model is capable of simulating human-like performance on artificial grammar learning.

متن کامل

Deblocking Joint Photographic Experts Group Compressed Images via Self-learning Sparse Representation

JPEG is one of the most widely used image compression method, but it causes annoying blocking artifacts at low bit-rates. Sparse representation is an efficient technique which can solve many inverse problems in image processing applications such as denoising and deblocking. In this paper, a post-processing method is proposed for reducing JPEG blocking effects via sparse representation. In this ...

متن کامل

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Machine learning: science and technology

سال: 2023

ISSN: ['2632-2153']

DOI: https://doi.org/10.1088/2632-2153/acdb30